85 research outputs found
Count-Based Exploration with the Successor Representation
In this paper we introduce a simple approach for exploration in reinforcement
learning (RL) that allows us to develop theoretically justified algorithms in
the tabular case but that is also extendable to settings where function
approximation is required. Our approach is based on the successor
representation (SR), which was originally introduced as a representation
defining state generalization by the similarity of successor states. Here we
show that the norm of the SR, while it is being learned, can be used as a
reward bonus to incentivize exploration. In order to better understand this
transient behavior of the norm of the SR we introduce the substochastic
successor representation (SSR) and we show that it implicitly counts the number
of times each state (or feature) has been observed. We use this result to
introduce an algorithm that performs as well as some theoretically
sample-efficient approaches. Finally, we extend these ideas to a deep RL
algorithm and show that it achieves state-of-the-art performance in Atari 2600
games when in a low sample-complexity regime.Comment: This paper appears in the Proceedings of the 34th AAAI Conference on
Artificial Intelligence (AAAI 2020
Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift
In this paper we revisit the method of off-policy corrections for
reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this
method, online updates to the value function are reweighted to avoid divergence
issues typical of off-policy learning. While Hallak et al.'s solution is
appealing, it cannot easily be transferred to nonlinear function approximation.
First, it requires a projection step onto the probability simplex; second, even
though the operator describing the expected behavior of the off-policy learning
algorithm is convergent, it is not known to be a contraction mapping, and
hence, may be more unstable in practice. We address these two issues by
introducing a discount factor into COP-TD. We analyze the behavior of
discounted COP-TD and find it better behaved from a theoretical perspective. We
also propose an alternative soft normalization penalty that can be minimized
online and obviates the need for an explicit projection step. We complement our
analysis with an empirical evaluation of the two techniques in an off-policy
setting on the game Pong from the Atari domain where we find discounted COP-TD
to be better behaved in practice than the soft normalization penalty. Finally,
we perform a more extensive evaluation of discounted COP-TD in 5 games of the
Atari domain, where we find performance gains for our approach.Comment: AAAI 201
Increasing the Action Gap: New Operators for Reinforcement Learning
This paper introduces new optimality-preserving operators on Q-functions. We
first describe an operator for tabular representations, the consistent Bellman
operator, which incorporates a notion of local policy consistency. We show that
this local consistency leads to an increase in the action gap at each state;
increasing this gap, we argue, mitigates the undesirable effects of
approximation and estimation errors on the induced greedy policies. This
operator can also be applied to discretized continuous space and time problems,
and we provide empirical results evidencing superior performance in this
context. Extending the idea of a locally consistent operator, we then derive
sufficient conditions for an operator to preserve optimality, leading to a
family of operators which includes our consistent Bellman operator. As
corollaries we provide a proof of optimality for Baird's advantage learning
algorithm and derive other gap-increasing operators with interesting
properties. We conclude with an empirical study on 60 Atari 2600 games
illustrating the strong potential of these new operators
- …